AITopics

Genre: Research Report > Experimental Study (1.00)

Industry:

Media (0.67)
Leisure & Entertainment (0.46)
Health & Medicine (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
Information Technology > Artificial Intelligence > Speech (0.67)

Neural Information Processing SystemsJun-12-2026, 14:03:50 GMT

Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60\%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments.

artificial intelligence, proceedings, target speaker, (6 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.59)

Neural Information Processing SystemsFeb-13-2026, 01:16:25 GMT

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Neural Information Processing Systems http://nips.cc/

speaker encoder, speech, utterance, (15 more...)

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.66)

Neural Information Processing SystemsNov-20-2025, 17:03:52 GMT

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, speech, (19 more...)

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.66)

arXiv.org Artificial IntelligenceNov-11-2025

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

Wu, Wenxuan, Wang, Shuai, Wu, Xixin, Meng, Helen, Li, Haizhou

Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demonstrate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.

artificial intelligence, large language model, natural language, (18 more...)

2511.06288

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Ishikura, Seiya, Yamada, Hiroaki, Hiraoka, Tatsuya, Yamada, Hiroaki, Tokunaga, Takenobu

Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

arXiv.org Artificial IntelligenceOct-30-2025

This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker's thought before articulating the utterance. We expect "persona LLMs" trained with TAU-augmented data can mimic the speaker's personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers' Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM's performance.

large language model, machine learning, utterance, (18 more...)

2510.09158

Country:

Asia (1.00)
North America > United States (0.28)
North America > Mexico (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

arXiv.org Artificial IntelligenceOct-13-2025

Target speaker anonymization in multi-speaker recordings

Tomashenko, Natalia, Yamagishi, Junichi, Wang, Xin, Liu, Yun, Vincent, Emmanuel

Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer's voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.

artificial intelligence, machine learning, target speaker, (15 more...)

2510.09307

Country: Asia > Japan (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Fortier, Alexandrine, Joshi, Sonal, Thebaud, Thomas, Villalba, Jesús, Dehak, Najim, Cardinal, Patrick

Multi-Target Backdoor Attacks Against Speaker Recognition

arXiv.org Artificial IntelligenceOct-10-2025

--In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. T o simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker--based on cosine similarity--as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases. In recent years, speaker recognition systems have achieved strong performance. However, they remain susceptible to significant security risks, including malicious attacks [1]-[6]. Due to constraints in data and computational resources, many organizations rely on external providers for model development or data collection. A particularly concerning threat is backdoor attacks, which are introduced during training. The backdoor itself is a hidden mechanism the model learns during training: when a specific input pattern--known as a trigger--is present, the model consistently produces a target output, regardless of the true input.

machine learning, natural language, pattern recognition, (19 more...)

2508.08559

Country: North America (0.28)

Genre: Research Report (0.65)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.62)

arXiv.org Artificial IntelligenceAug-26-2025

WildSpoof Challenge Evaluation Plan

Wu, Yihan, Jung, Jee-weon, Shim, Hye-jin, Cheng, Xin, Wang, Xin

The WildSpoof Challenge aims to advance the use of in-the-wild data in two intertwined speech processing tasks. It consists of two parallel tracks: (1) Text-to-Speech (TTS) synthesis for generating spoofed speech, and (2) Spoofing-robust Automatic Speaker Verification (SASV) for detecting spoofed speech. While the organizers coordinate both tracks and define the data protocols, participants treat them as separate and independent tasks. The primary objectives of the challenge are: (i) to promote the use of in-the-wild data for both TTS and SASV, moving beyond conventional clean and controlled datasets and considering real-world scenarios; and (ii) to encourage interdisciplinary collaboration between the spoofing generation (TTS) and spoofing detection (SASV) communities, thereby fostering the development of more integrated, robust, and realistic systems.

artificial intelligence, machine learning, participant, (16 more...)

2508.16858

Country: Europe (0.15)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.73)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.36)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.36)

Itani, Malek, Graves, Ashton, Eskimez, Sefik Emre, Gollakota, Shyamnath

Neural Speech Extraction with Human Feedback

arXiv.org Artificial IntelligenceAug-6-2025

We present the first neural target speech extraction (TSE) system that uses human feedback for iterative refinement. Our approach allows users to mark specific segments of the TSE output, generating an edit mask. The refinement system then improves the marked sections while preserving unmarked regions. Since large-scale datasets of human-marked errors are difficult to collect, we generate synthetic datasets using various automated masking functions and train models on each. Evaluations show that models trained with noise power-based masking (in dBFS) and probabilistic thresholding perform best, aligning with human annotations. In a study with 22 participants, users showed a preference for refined outputs over baseline TSE. Our findings demonstrate that human-in-the-loop refinement is a promising approach for improving the performance of neural speech extraction.

artificial intelligence, deep learning, machine learning, (19 more...)

2508.03041

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)